Query processing for datacenter-scale computers

نویسنده

  • Spyros Blanas
چکیده

Quickly exploring massive datasets for insights requires an efficient data processing platform. Parallel database management systems were originally designed to scale only to a handful of nodes, where each node keeps recent (“hot”) data in memory and has directlyattached hard disk storage for infrequently accessed (“cold”) data. To keep pace with the growing data volumes, the research focus has been shifting towards rack-scale architectures. Rack-scale systems combine powerful nodes with directly-attached storage for “warm” data with hundreds of terabytes of network-attached storage for “cold” data. Exemplars include Oracle Exadata, the IBM PureData System and the Microsoft Analytics Platform System. Processing even larger datasets quickly will inevitably require datacenter-scale computers. Although details of the hardware configurations of commercial datacenters are scarce, one can use scientific computers for data-intensive applications as a proxy. Compared to rack-scale architectures, the “hot” data path consists of a few petabytes of memory fragmented across ten thousand nodes, and proprietary network interconnects in unique topologies. The “cold” data path retrieves data through a parallel file system (such as Lustre) from many petabytes of network-attached storage. We posit that the optimizations a DBMS performs individually within a node or across a handful of powerful nodes are insufficient when query processing becomes a datacenter-scale challenge. We argue that the opportunity for database systems research is to lead the transition from the cloud-inspired view of a datacenter as a collection of CPUs towards the view of a datacenter as a single entity that requires careful task orchestration for peak aggregate performance. We identify the following research directions towards realizing this vision.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Efficient Stream Processing in the Wide Area

Stream processing—the processing of large volumes of live data in real time—has been studied for decades [5]. However , recent years have seen an explosion in the number and variety of devices producing data streams, from software logs to handheld phones to aerial cameras, with an estimated 2.5 quintillion bytes of data created each day [1]. It is now commonplace for related data streams to ari...

متن کامل

ArrayBridge: Interweaving declarative array processing with high-performance computing

Scientists are increasingly turning to datacenter-scale computers to produce and analyze massive arrays. Despite decades of database research that extols the virtues of declarative query processing, scientists still write, debug and parallelize imperative HPC kernels even for the most mundane queries. This impedance mismatch has been partly attributed to the cumbersome data loading process; in ...

متن کامل

LOT: A Robust Overlay for Distributed Range Query Processing

Large-scale data-centric services are often handled by clusters of computers that include hundreds of thousands of computing nodes. However, traditional distributed query processing techniques fail to handle the large-scale distribution, peer-to-peer communication and frequent disconnection. In this paper, we introduce LOT, a robust, fault-tolerant and highly distributed overlay network for lar...

متن کامل

Replay Debugging for the Datacenter

Replay Debugging for the Datacenter by Gautam Deepak Altekar Doctor of Philosophy in Computer Science University of California, Berkeley Professor Ion Stoica, Chair Debugging large-scale, data-intensive, distributed applications running in a datacenter (“datacenter applications”) is complex and time-consuming. The key obstacle is non-deterministic failures—hard-to-reproduce program misbehaviors...

متن کامل

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017